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Abstract 

Wc consider stability and uniqueness in real phase retrieval prob- 
lems over general input sets. Specifically, we assume the data consists 
of noisy quadratic measurements of an unknown input x € M™ that lies 
in a general set T and study conditions under which x can be stably 
recovered from the measurements. In the noise-free setting we derive 
a general expression on the number of measurements needed to ensure 
that a unique solution can be found in a stable way, that depends on 
the set T through a natural complexity parameter. This parameter can 
be computed explicitly for many sets T of interest. For example, for 
A:-sparse inputs we show that 0(fclog(n/fc)) measurements are needed, 
and when x can be any vector in K™, 0(n) measurements suffice. In the 
noisy case, we show that if one can find a value for which the empirical 
risk is bounded by a given, computable constant (that depends on the 
set T), then the error with respect to the true input is bounded above 
by an another, closely related complexity parameter of the set. By 
choosing an appropriate number TV of measurements, this bound can 
be made arbitrarily small, and it decays at a rate faster than N~ x / 2+s 
for any 5 > 0. In particular, for fc-sparse vectors stable recovery is 
possible from 0(klog(n/k) log A;) noisy measurements, and when x can 
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be any vector in M™, O(nlogn) noisy measurements suffice. We also 
show that the complexity parameter for the quadratic problem is the 
same as the one used for analyzing stability in linear measurements 
under very general conditions. Thus, no substantial price has to be 
paid in terms of stability if there is no knowledge of the phase. 

1 Introduction 

Recently there has been growing interest in recovering an input vector x G 
W 1 from quadratic measurements 

yi = \(ai,x)\ 2 + Wi , i = l,...,N (1.1) 

where u>, is noise, and a% are a set of known vectors. Since only the magni- 
tude of (at, x) is measured, and not the phase (or the sign, in the real case), 
this problem is referred to as phase retrieval. Phase retrieval problems arise 
in many areas of optics, where the detector can only measure the magnitude 
of the received optical wave. Several important applications of phase re- 
trieval include X-ray crystallography, transmission electron microscopy and 
coherent diffractive imaging [39j [22j EU [47] . 

Many methods have been developed for phase recovery [22] which often 
rely on prior information about the signal, such as positivity or support 
constraints. One of the most popular techniques is based on alternating 
projections, where the current signal estimate is transformed back and forth 
between the object and the Fourier domains. The prior information and 
observations are used in each domain in order to form the next estimate. 
Two of the main approaches of this type are Gerchberg-Saxton |18j and 
Fienup [T7]. In general, these methods are not guaranteed to converge, and 
often require careful parameter selection and sufficient prior information. 

To circumvent the difficulties associated with alternating projections, 
more recently, phase retrieval problems have been treated using semidefi- 
nite relaxation, and low-rank matrix recovery ideas (6J H3J. In [6] several 
masks where used in the measurement process in order to ensure the abil- 
ity to retrieve the phase. Another approach to generate robust solutions 
is to assume that the input signal x is sparse, namely, that it contains 
only a few non-zeros values in an appropriate basis expansion. Sparsity has 
long been exploited in signal processing, applied mathematics, statistics and 
computer science for tasks such as compression, denoising, model selection, 
image processing and more. Despite the great interest in exploiting sparsity 
in various applications, most of the work to date has focused on recovering 
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sparse or low rank data from linear measurements |15[ [TT| [5]. Recently, 
the basic sparse recovery problem has been generalized to the case in which 
the measurements are quadratic [33], or given by a more general nonlin- 
ear transform of the unknown input [2]. The first paper to consider sparse 
phase retrieval was |43j . based on semidefinite relaxation combined with a 
row-sparsity constraint on the resulting matrix. An iterative thresholding 
algorithm was then proposed that approximates the solution. Similar ap- 
proaches were later used in [23[ [37] . An alternative algorithm was recently 
designed in [2] [32] using a greedy search method which is far more efficient 
than the semidefinite relaxation, and often yields more accurate solutions. 

Despite the vast interest in phase retrieval, there has been little the- 
oretical work on the fundamental limits of this problem. One important 
question in this context is how many measurements are needed in order to 
ensure robust recovery of the input x, regardless of the specific recovery 
method used. Several recent works treat this problem. Most of the papers 
discuss the case in which x is a general input, namely, there is no sparsity (or 
other) constraint on x. The first result of this kind was obtained in [JJ, where 
it is shown that with probability one N = 4n — 2 randomized equations are 
sufficient for recovery using a brute force (intractable) method, when there 
is no noise. However, it is not clear whether a stable recovery method ex- 
ists with this number of measurements. In [7J [8] the authors consider the 
case in which <2j are real or complex vectors that are either uniform on the 
sphere of radius y/n, or iid zero- mean Gaussian vectors with unit variance. 
Under these assumptions they show that on the order of n measurements 
are needed in order to recover a generic x using a semidefinite relaxation 
approach. In the presence of noise, it is shown in [7J that one can find an 
estimate x satisfying 

\\x - e*+x\\ 2 < C min ( \\x\\ 2 , ^f) , (1-2) 

V N \\ x hJ 

for some (p, where Co is a constant and w is the noise vector that is assumed 
to be bounded so that ||io||i is finite. 

The paper [29] treats the case in which the input x is /c-sparse and aj 
are iid zero-mean normal vectors. When there is no noise, they show that 
in the real case N > 4k — 1 measurements are needed for uniqueness and 
in the complex case, N > 8k — 2 measurements are required. They further 
prove that if N is on the order of k 2 log n then the solution can be obtained 
using a sparse semidefinite relaxation approach as in |43[ [23] . 

Here we treat the real case and random measurements, using reasonable 
ensembles. Our methods may be extended to the complex case (again, using 
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real, random measurements), but since the core of the problem is the real 
case, we will restrict our analysis to it. For this setting, we develop condi- 
tions leading to stable uniqueness, namely, conditions that ensure a unique 
solution can be found in a stable way. We do not restrict ourselves to a cer- 
tain class of inputs, but rather allow for general input sets T which include, 
as special cases, T = M. n and the class of sparse vectors. It turns out that 
to ensure stable uniqueness for a given set of signals, a very natural notion 
of complexity of the set determines the number of data points required. For 
example, we show that for fc-sparse vectors, 0{k\og{n/k)) measurements are 
needed for stability. This result is better by a factor of k than the estimate 
from |29| that guarantees recovery (without noise) using a semidefinite ap- 
proach. When x can be any vector in R n , we show that 0{n) measurements 
suffice, which is also the bound derived in [7] for recovery using semidefinite 
relaxation. 

It turns out that the natural complexity parameter for this problem is 
the same as the one used for analyzing stability in linear measurements, as 
we will discuss in Section [5) Thus, in a rather general sense, the number of 
measurements required for stable recovery in the quadratic setting we treat 
here is of the same order of magnitude as the one needed to ensure stability 
under linear sampling. In that sense there is no substantial price to be paid 
for not knowing the phase of the measurements, and for very general choices 
of input sets T. 

The second main result of this article deals with the noisy phase re- 
trieval problem. More specifically, we consider recovering an input x in a 
set T from noisy measurements of the form (jl.ip . A straightforward ap- 
proach is to seek the value of x that minimizes the empirical risk (or a 
least-squares approach). Since this leads to a nonconvex problem, finding 
its global solution is in general not possible. Nonetheless, we show that if 
one can find a value x, for which the empirical risk is bounded by a given, 
computable constant (which depends on the set T), then \x — xolkll^ + ^o||2 
is bounded above by an expression that once again depends on the complex- 
ity parameter of the set, and which converges to faster than _/V~ 1 / 2+<5 for 
any 5 > 0. Here xq is the true (unknown) input. The complexity parameter 
that determines the rate in the noisy setting is essentially the same as in 
the stability analysis. Moreover, the resulting sample complexity is of the 
same order of magnitude as in the linear case in the examples of sets T we 
consider. An exact formulation of both main results is presented in the next 
section. 

An important practical conclusion from this analysis is that although 
the squared-error in the case of nonlinear measurements as in (II. ID cannot 
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be minimized directly, it is sufficient to find a point for which the error is 
bounded by a known constant. Thus, one may use any desired recovery 
algorithm and check whether the solution x satisfies the bound. For this 
purpose, methods such as those developed in [12] are advantageous since 
they allow for arbitrary initial points. As different initializations lead to 
different choices of x, the algorithm can be used several times until an ap- 
propriate value of x is found. Our theoretical results ensure that such an 
x is sufficiently close to xo or to —xq if enough measurements are used. 
In particular, for /c-sparse vectors one can guarantee stable recovery from 
0(k\og(n/k)\ogk) noisy measurements, and when x can be any vector in 
]R n , 0(n log n) noisy measurements suffice. 

The reminder of the article is organized as follows. The problem and 
the main results are formulated in Section [2j Stability results in the noise- 
free setting are developed in Section [3l while the noisy setting is treated in 
Section |4l In Section [5] the relation between the results in the quadratic case 
and those in the linear setting are discussed. 

Throughout the article we use the following notation. The statistical 
expectation is denoted by E, and if the probability space is a product space 
(O x n',n (g> //), E^ and E^/ are the conditional expectations. If X is a 
random variable, then = (E|X| p ) 1//p . The relation a ~ b means that 

a is equal to b up to absolute multiplicative constants, i.e., that there are 
c and C, independent of a, b or any other parameters of the problem, for 
which ca < b < Ca. The inequality a < b means that a < Cb for some 
constant C. We use a <l,7 b to denote the fact that the constant C depends 
only on L and 7. 

2 Problem Formulation and Main Results 

Suppose one is given measurements yi as in (II. lj) . Let s be a vector in R , set 
(f>{s) to be the length- N vector with elements \si\ 2 and put Ax = {{ai,x))f =l . 
With this notation, (jl.ip can be written as 

y = <p(Ax) + w. (2.1) 

Our goal is to study conditions under which stable recovery is possible ir- 
respective of the specific recovery method used, and to develop guarantees 
that ensure that empirical minimization or approximate empirical minimiza- 
tion (namely, least-squares recovery) lead to an estimate x that is close to 
x in a squared-error sense. 
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2.1 Assumptions on x and a 

We assume throughout that x lies in a subset T of R n , which can be arbitrary. 
It is natural to expect that the number of measurements needed for stable 
recovery or for noisy recovery depend on the set T, though the way in which 
it depends on T is not obvious. Here, we prove that this number is a function 
of a natural complexity parameter of T, in an estimate that is sharp in the 
stable recovery problem, and sharp up to logarithmic factors in the noisy 
recovery problem. 

The assumption on the measurement vectors a% is that they are inde- 
pendent, and distributed according to a probability measure \i on R n that 
is isotropic and L-subgaussian [9J 3HJ 0] : 

Definition 2.1 Let jjl be a probability measure onW 1 and let a be distributed 
according to fi. The measure fi is isotropic if for every t £ M. n , K\(a,t}\ 2 = 
\\t\\\. It is L-subgaussian if for every t £ M. n and every u > 1, Pr(\(a,t}\ > 
Lu\\(t,a)\\ 2 ) < 2exp(-u 2 /2). 

Among the examples of isotropic, L-subgaussian measures on M. n for a 
constant L that is independent of the dimension n are the standard Gaussian 
measure, the uniform measure on { — 1, l} n and the volume measure on the 
"correct" multiple of the unit ball of £™ for 2 < p < oo (that is, the volume 
measure on c n n 1 l p B™, where c n ~ 1). Also, if X is a mean-zero, variance 1 
random variable that satisfies Pr(|X| > Lu) < 2exp(— u 2 /2), then a vector 
of iid copies of X is isotropic and cL subgaussian, for a suitable absolute 
constant c. 

More generally, we have the following: 

Definition 2.2 Lf F is a class of functions on a probability space (0,/i), 
then it is L-subgaussian if for every f,h G FU {0} and every t > 1 

Pr {\f - h\{X) > tL\\f - g\\ 2 ) < 2exp(-t 2 /2), 

where X is distributed according to \i. 

It is standard to verify if (i is an L-subgaussian measure on W 1 then 
every class of linear functionals on R n is L-subgaussian (see, e.g. [9]). 

Our goal is to study when the mapping <f>(Ax) is both invertible and 
stable first, when w = (the noise-free case) and second, in the presence of 
noise. 
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2.2 Stability Results 

We begin in Section [3] by treating the noise- free setting. Since one is given 
only the absolute values of Ax, it is impossible to distinguish x and —x. 
Therefore, uniqueness will always be up to the sign of x. If <j)(Ax) is an 
invertible stable mapping, it is natural to expect that for any s and t for 
which s ^ t and s ^ —t, 4>(As) is far enough from (ft(At) in some sense; here 
we consider the l\ sense. 

Definition 2.3 The mapping (j)(Ax) is stable with a constant C in a set T 
if for every s,t £ T , 



Note that stability in a set is a much stronger property than invertibility. 
Indeed, for the latter it suffices that if s ^ ±t then ||</>(^4i) — <^(^4s)||i > 0, 
but without any quantitative estimate on the difference. 

The l\ norm, used on the left-hand side, is the natural way of measuring 
distances for the quadratic function <j), if one wishes to compare the results 
with the linear case, in which the £2 distance is used (see Section [5] for 
more information). Using the £\ distance also has a technical advantage, 
as it simplifies the analysis considerably. Measure distances based on other 
£ p norms lead to processes that are much harder to control, since higher 
powers emphasize the "unbounded" or "peaky" parts of a random variable, 
and make concentration around the mean much harder. 

To formulate our stability result, let us define the main complexity pa- 
rameter required. For T C K n , define 



Let (gi)f = i be independent Gaussian random variables, that have mean zero 
and variance 1. Set 



\\<t>{At) - 0(Aa)||i > C\\s - t\\ 2 \\s + t\\ 2 . 



(2.2) 




(2.3) 




(2.4) 



and put 



E E 2 



(2.5) 



Pt,n = 



Vn n' 
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Throughout this section, we will refer to pr,N as the complexity measure of 
T. 

The main result in the noise free case is the following: 

Theorem 2.4 For every L > 1 there exist constants C\,C2 and C3 that 
depend only on L for which the following holds. Let p be an isotropic, 
L-subgaussian measure. Then, for u > c\, with probability at least 1 — 
2 exp(— C2U 2 minjiV, E 2 }), for every s,t&T, 

\\(f)(As) - 4(At)\\i > \\s - t\\ 2 \\s + t\\ 2 (k(s -t,s + t)- c 3 u 3 p T)N ) , 

where for every v, w S M. n , 



To put Theorem [23] in the right perspective, one has to obtain lower bounds 
on k(s — t,s + t) and upper bounds on pr,N- Since the latter depends on the 
number of measurements N, its behavior provides insight into the number 
of measurements that are needed for stability. 

The value of k(v, w) may be bounded using several methods, as we will 
explain in Section 13.2.21 One natural example in which inf^ wgS n-i k(v, w) 
is bounded from below (where <S n_1 is the unit Euclidean sphere in W 1 ), is 
when a satisfies a small-ball assumption, namely, that for every t G M. n and 
every e > 0, 



depends on the constant c in (|2.7p . This assumption is satisfied for a large 
family of measures, such as the Gaussian measure on M. n (see Section [3.2.21 
for more details). 

As for pt,n, w e will show, for example, that if T is the set of fc-sparse 
vectors in R n , then pt,n ^5 \/klog(en/k)/N. Hence, under (|2.7p . since 
inf^ wgS n-i k(v,w) > ci, it suffices to select N large enough to ensure that 
c 2 u3 Pt,n ^ c i/2 to obtain a stability result. This leads to the following 



Corollary 2.5 For every L, c > there exist absolute constants c\, 02,03 
and C4 for which the following holds. Let T be the set of k-sparse vec- 
tors in M. n , set p, to be an isotropic, L-subgaussian measure, and assume 
that a is distributed according to p. If a satisfies (|2.7p with constant c, 




(2.6) 




estimate: 
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In particular, the result is true for a random Gaussian matrix A, where 
c 1j c 2>C3 and are absolute constants. 

Interestingly, it can be shown that in the case of linear measurements, 
stable recovery is guaranteed as long as N ~ k\og(en/k). Thus, the number 
of measurements needed for stable recovery in the nonlinear and linear set- 
tings is the same up to multiplicative constants - at least for ensembles that 
have a well behaved \ni v w&sn ~i k(v,w). As mentioned in the introduction, 
this observation is not a coincidence and will be explained in more detail in 
Section 

In Section [3,21 we study other choices of T, and the number of measure- 
ments needed in order to guarantee stability. 

2.3 Noisy Recovery Results 

Section U] is devoted to the case in which the measurements are contaminated 
with iid noise. The goal is to find a point x for which \\x — a^olbll^ + 1 1 2 is 
small, using the data {ai,yi)f =l and the fact that y is generated according 
to (jl.ip for some xq G T. 

A natural approach is to recover xq from y by minimizing the empirical 
risk: 

1 N 

mm£ x = min— .} \yi - |(aj,x)| 2 | P , (2.8) 

x x /V * — * 

i=l 

for some p > 1. The objective in (|2.8p is not convex, and therefore it is not 
clear how to find the value xq minimizing (|2.8p . Fortunately, we prove that in 
order to find an estimate x close to xq one does not need to strictly minimize 
(|2.8p . Instead, it is sufficient to find a point x for which the empirical risk 
£x is small enough. 

Definition 2.6 Given a set T C K n , let 

n 

£(T) =Esup| (2.9) 
* eT 1=1. 

be the Gaussian complexity ofT, where gi,...,g n are iid standard Gaussian 
variables, and put 

d{T) = sup ||t|| 2 . (2.10) 

From the geometric point of view, £(T) measures the best correlation 
(or width) of T in a random direction generated by the random vector 
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G = (<7i, ...,g n ). This parameter appears in many different areas of math- 
ematics, and is also essential in the study of compressed sensing problems 
(see, for example [9]). We refer the reader to the books [28l 06l [381 ES] for 
more information on this parameter and for methods of computing it. For 
example, it is well known that l(T) can be bounded from above and below 
(with a possible \/log n gap between the upper and lower bounds) using of 
the £2 covering numbers of T. 

Suppose that for a given 1 < p < 2, and u > 1 (which will later on 
govern our probability estimates), one produces x satisfying 

, N /III |P|| \ 

A \\(^,x)\ 2 - Vl \ p < EM* + u [Qt,n,w ~ ^7=^J • (2-11) 

Here IIM^Uq,, 1 < a < 2 is a measure of the decay properties of the noise, 
and will be defined formally in (|4.ip . Qt.n is a complexity measure similar 
to pr,N (defined formally in (|4.8|) ). and Qt,n,w = Qt,n + HM^I^i/v^- 
Our main result shows that, with high probability, such a point x is close to 
either xq or to — xq. To find an appropriate x, it is possible, for example, to 
use the greedy method of |42| with different starting points and stop once a 
solution that satisfies the bound is found. 

Theorem 2.7 For every k > and every L > 1 there exists constants 
c i> c 2i C3 and C4 that depend only on L and k, for which the following holds. 
Let a be distributed according to an isotropic, L-subgaussian measure, and 
assume that > K where kt = inf Si ( £ r k(s, t). Assume further that 
IMU2 ^ 00 • For every integer N set 

P N = max{ Cl ((ll^ll^ + d\T)) logN + l 2 (T)) , e} 

and 

p=l + l/\ogp N . 

Let x be chosen to satisfy (|2.1ip . Then, for u > ci, with probability at least 
1 - 2exp(-c 3 n 1 / 3 ), 

\\xo - xh\\xo + x\\ 2 < C4(uQ t ,n,w) 1/p V 1o S^n- 

Theorem 12.71 shows that stable recovery is possible if N > Qtn w 
In particular, for A;-sparse vectors, stable recovery is possible from 0(k log(n/k) log k) 
noisy measurements (this estimate is off only by a log k factor from the op- 
timal estimate in the linear case), and when xq can be any vector in W 1 , 
0(nlogn) noisy measurements suffice. 
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2.4 Technical Tool 



The main technical tool needed in the proof of both main results is a general 
estimate on properties of empirical processes indexed by {fh : / G F, h G 
H}. Although the result is true in a far more general situation than needed 
here, for the sake of simplicity we will present it only in the cases required. 
We refer the reader to [3T] for the more general statement and precise results. 

In the cases considered here, F and H are classes of linear functionals 
or of absolute values of linear functionals on M n , which is endowed with 
an isotropic, L-subgaussian probability measure [i. For the stability result 
F = : t G T+} and H = {\(t,-)\ : t G while in the noisy case, 

F = {{t — to, •) : t G T} and H = {{t + t^, : t G T}. In both scenarios, the 
two indexing sets are denoted by ?i,T2 C W 1 . 

Theorem 2.8 I31)j For every L > 1 there are constants 01,02,03 and C4 
that depend only on L and for which the following hold. Let T\ , T2 C M. n 
of cardinality at least 2 and set F and H to be the corresponding classes 
as above, respectively. Assume without loss of generality that l(T\) / d{T\) > 
l{T2)/d{T2). Then, for every u>c\, with probability at least 

1 - 2exp (-c 2 u 2 mm{N, (£(Ti)/d(Ti)) 2 }) , 



sup 

f£F,h£H 



N 



Efh 



< c 3 n d d{T 2 ) 



+ 



! (Ti 



N 



In particular, for every 

N 



> 2, 



sup 



L^2f( ai )h( ai )-Efh 



< 



3/2 



d(T 2 



£{T!) , t 2 {T x ) 



+ 



N 



3 Stability Results 

In this section we present the proof of Theorem 12. 4| followed by estimates 
on the values of n(v,w) and pt,n appearing in the theorem. 



3.1 Proof of Theorem 12^1 

Observe that 



N 



N 



U(M)- <f>(As)\\i =^2\\( ai ,t)\ 2 -\( 



Q>i , S 1 



i=l 



y^\{aj, s - t)(ai,s + t)\. 

(3.1) 



i=l 
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Therefore, to establish the desired stability result, it suffices to show that 



i.e., that 



where 



inf 



U(At) - cP(As) 



{s,t&T, s^±t} \\S — t\\2 S + th 



inf z t s > — . 
{s,teT, s^±t} ' N 



>C, 



1 N 

-Y 

i=i 



s — t . 
Is - tlh' 



s + t . 
\s + t\\ 2 ' 



(3.2) 



(3.3) 



Since k(s — t,s + t) = E% s , if k(s — t,s + t) is very small, then a random 
selection of a, is unlikely to lead to (|3.2p . Therefore, a reasonable pre- 
requisite for a stability result is that inf s -^ t) Sit& j> K (s — t,s + t) is bounded 
away from zero. Indeed, with this assumption, one may obtain a stability 
result. 



Proposition 3.1 For every L > 1 there exist constants c\,c 2 and C3 that 
depend only on L for which the following holds. Let \i be an isotropic, L- 
subgaussian measure on R n and set a to be a random vector distributed ac- 
cording to [i. Then, for u > c\, with probability at least 1—2 exp(— C2U 2 min{iV, E 2 
for every s,t G T , 

z s ,t > {k(s -t,s + t) — c 3 u 3 pt,n) \\s - *|| 2 II "S + * 1 1 2 - 

Proof. Observe that 



sup \zt, s — k(s— t, s+t) I = sup 

v€T + , u>eT_ 



1 N 

— \(ai,v)(ai,w)\ - E\(ai,v)(ai,w) 



■i=i 



By Theorem for F = {\(v,-)\ : v £ T + } and H = {\(w,-)\ : 10 G T_}, 
it follows that if N > c\E 2 and u > c 2 then with probability at least 
1 - 2exp(-c 3 u 2 £ 2 ), 



sup 

v£T-, w£T + 



1 N 

— I (ai, v) (di,w) I - E| (di, v) (di, w) 



i=l 



< C4U 3 p T) N- (3.4) 



The claim now follows immediately from the definition of z S) t an d of k(s 
t,s + t). 



12 



3.2 Computing k and pt,n 



3.2.1 Bounding pt,n 

It is well known that if T C M n then i(T) (and therefore, pr,N as well) is 
determined by the Euclidean metric structure of T. This is the outcome of 
the celebrated majorizing measures/generic chaining theory (see the books 
[281 HI US] for a detailed exposition on this topic). In the examples we 
present here, the following estimate, which is, in general, suboptimal, suf- 
fices. 

Definition 3.2 Let (T, d) be a compact metric space. For every e > 0, let 
N(T, d, e) be the smallest number of open balls of radius e needed to cover 
T. The numbers N(T, d, e) are called the e-covering numbers of T relative 
to the metric d. 

Given T C M n , set N(T,e) = N(T, || l^e), i-e-, the covering numbers rela- 
tive to the Euclidean metric. 

Proposition 3.3 There exist absolute constants c and C for which the fol- 
lowing holds. If T CK™ then 



The upper bound is due to Dudley [13] and the lower to Sudakov [45J . The 
proof of both bounds may be found, for example, in [281 [38], 112]. 

It is straightforward to verify that the gap between the upper and lower 
bounds in Proposition 13.31 is at most ~ \/Iogn, and in all the examples we 
study below, the resulting estimate is sharp. 

3.2.2 Bounding k 

Here, we present two simple methods for bounding inf t , we 5»-i k(v, w) from 
below. These methods are not the only possibilities by which one may obtain 
such a bound; rather, they serve as an indication that the assumption on k 
is less restrictive than may appear at first glance. 

Recall the small ball assumption: for every t £ W 1 and every e > 0, 
Pr(\(a,t)\ < e\\t\\ 2 ) < ce. 

Lemma 3.4 If a satisfies the small ball assumption with constant c then 




inf k(v, w) > k, 



where k depends only on c. 
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Proof. Consider e for which ce < 1/4. Then for every v G S" 1-1 , there is 
an event of measure at least 3/4 on which \(v, a)| > e. Hence, for two fixed 
vectors v,w G S n , 

Pr({\{v,a)\ >e}n{\(w,a)\ > e}) > 1/2, 

and thus E|(«, a)(tu, o)| > e 2 /2. ■ 

This type of small ball property is true in many case. The simplest 
example is the standard gaussian measure on R n . Indeed, if a = (g±, g n ) 
then |(a,io>| is distributed as 1 1 to 1 1 2 1 1 and the small ball property follows 
immediately by applying the estimate on the density of g. 

A more general example is based on the notion of log-concavity. A mea- 
sure ji on R n is called log-concave if for every nonempty, Borel measurable 
sets A,B C R n , and any < A < 1, fi(XA + (1 - X)B) > fi x (A)fi 1 ^ x (B), 
where \A+(1- X)B = {Xa + (1 - X)b : a G A, b G B}. It is well known that 
fj, is a log-concave measure if and only if it has a density of the form exp(^) 
for a concave function (f> : R n — > R. 

The following lemma is standard (see e.g. [T9| 13"! [35]) . 

Lemma 3.5 There exists an absolute constant c for which the following 
holds. Let a be distributed according to an isotropic, symmetric, log-concave 
measure. Then for every 9 G S 71 " 1 , (a, (9) is distributed according to an 
isotropic, symmetric, log-concave measure on R. Also, if fg is the density 
of (a, 6) then \\f e \\oo < c. 

The desired small-ball estimate clearly follows from the lemma, since 

Pr(\(a,9)\ <e) = J" f g (t)dt < 2ce. 

Among the family of log-concave measures are volume measures on convex 
symmetric bodies (i.e. measures that have a constant density on the body 
and zero outside the body). Moreover, it can be shown (see, e.g. [H]), that 
for every convex body K C R n there is an invertible linear operator G for 
which the volume measure on GK is also isotropic. 

Another example of log-concave measures on R n are product measures 
of log-concave measures on R. If X is a real valued, symmetric, log-concave 
random variable (i.e. with a log-concave density) with variance one, and 
X±,...,X n are iid copies of X, then a = (X\, X n ) is an isotropic log- 
concave measure on R n . Standard examples for log-concave measures on 
R are those with a density proportional to exp(— c p |t| p ) for p > 1. And, 
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of course, any measure with density proportional to exp(</>) for a convex 
function (f> is log-concave. 

The second method, which we only outline, is based on the Paley- 
Zygmund argument. 

Lemma 3.6 JiOJ / Let Z be a random variable, set < p < q and put Cp q = 
\\ Z \\L p /\\Z\\L q - Then, for every < A < 1, 

Pr(\Z\>X\\Z\\ Lp )>[(l-Xn^ q ]^, 

and in particular, M\Z\ > ci\\Z\\l p , where c\ depends only on p, q and Cp,q- 

We will use the lemma for p = 2 and q > 2. Assume that (Xi)™ =1 are iid 
copies of a symmetric, variance 1 random variable and set a = (X\, ...,X n ). 
Ifv,w G S n , then a straightforward computation shows that 



E| (a, v) (a,w)\ 2 =Y^ v}w) + 2 s £ j (v l w i )(vjWj) + EX 4 

ijtj i^j i=l 



N 

2 2 

v i w i 



(vfw 2 + (viWi)(vjWj)) + (v,w) 2 + (EX 4 - 1) y~]v7' 



" 2 ' < vvi ^^vfwl 
(3.5) 



Using the fact that \\v\\2 = \\w\\2 = 1) (|3-5p reduces to 

n n 

K\(a,v)(a,w)\ 2 = 1 + 2(v,w) 2 -2^%fw| + (EX 4 - 1) j^«fio?. (3.6) 



, 2 

- 'Z 



Consider two cases. First, if Yli=i v j w j — 1 / 10, and since EX 4 > 
(EX 2 ) 2 = 1, then by (l3T6j) . 

E\(a,v)(a,w)\ 2 > 1/2. 

On the other hand, if the reverse inequality holds, then using 

^ [vfw 2 + (viWi)(vjWj)) + (v,w) 2 = ^(viWj + VjWif + (v,w) 2 > 0, 
ijtj i>j 



and applying (|3.5p . 

n 

E|(a,t>)(a,w;}| 2 > (EX 4 - 1)^\ 2 «; 2 > (EX 4 - 1)/10. 



i=l 
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Corollary 3.7 Let X be a symmetric, variance 1 random variable, with a 
finite L2 q moment for some q > 2. If a = (X\, ...,X n ) then 



inf k(u, v) > c(EI 4 - 1 



|V2 



where c depends on q and on ||A||_£, 2 . 

Observe that the two assumptions we make are not very restrictive, since 
we assume throughout that a is isotropic and L-subgaussian. Hence, if 
a = (Ai,...,A n ), then ||A||r < Lq for every q>2 (see Section fCT]) . Also 
note that for any random variable X, EA 4 > (EX 2 ) 2 = 1, so that the 
square-root is well denned. 

Proof. Assume that X 6 L2 q for some q > 2. Observe that if v £ 
then for every 2 < r < 2q, \\\a,v)\\L r < c r ||A||£ r . Indeed, by a Rosenthal 
type inequality (see, e.g. [TO], Section 1.5), 

\ l/r ' 



v i X ihr < c r max I v?EX? , J] <E| A*| r 




i=l 



Since ||A||£ r > ||A||i 2 and ||t>|| r < ||u||2 = 1, the claim follows. 
Therefore, sup„ gS n-i ||(a, u)||l 2 , ^ c gll^lli2 9 ; and thus, 



\(a,v)(a, w)\\ Lq < ||<a,u)||i 2 J<a,t£;>||i 29 < g \\X 



2 

Il 2 



Let Z = (a, u)(a, w). Using the notation of Lemma |3.6| C2 j(J > (E|A| 4 — 
l) 1 / 2 /!^!! 2 ^. Hence, for every v,w € S" -1 , E|(a,u)(a,K;)| > c(E|A| 4 - 
l) 1 / 2 , where c depends only on q and on ||X||i 2 , as claimed. ■ 

Observe that if EA 4 = 1, it is possible that E|<a, f )(a, w)\ =0, even for 
«, io of the specific form one would like to control - namely, v = (s + t)/\\s + 
t\\2) and w = (s — t)/\\s — t\\2 for s ^ ±t. Indeed, let A be a symmetric, 
{ — 1, l}-valued random variable (and in particular, it is L-subgaussian as 
well). Let (ej)™ =1 be the standard basis in K n and set s = e\, t = e^- 
It is straightforward to verify that in this case, E|^a, v)(a, = with 
probability 1, and therefore, the assumption on EA 4 can not be relaxed if 
one is interested in a uniform bound. 



3.3 Examples 

Let us turn to a few special cases of Theorem 12.41 To that end, explicit 
expressions for pt,n are required for the sets of interest. 
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3.3.1 Entire Space T = W 1 

If T = W 1 then T + = T_ = S"™ -1 , where is the Euclidean unit sphere 

in M. n . Therefore, 

n n 



E = E sup J29i x i = E Q29i) 1/2 ^ V^, 

implying that 



TL TL 



Corollary 3.8 For every L > 1 i/iere are constants c\, c 2 and C3 i/iat de- 
pend onZy on L and for which the following holds. If hif v w ^s n - 1 K ( u > w ) — K > 
u > c\ and N > C2U S n/n 2 , then with probability at least 1 — 2exp(— c^un), 
for every s,t £ M n , 

\\<p(As) -4>{At)\\ x > ^\\s-t\\ 2 \\s + t\\ 2 . 

The corollary follows from the fact that with this choice of N, cu 3 pt,n is 
proportional to k/2. 

When k is given by a constant, independent of the dimension n, Corol- 
lary [378] implies that it is sufficient to choose N ~ n to ensure stable recovery 
with high probability. 



3.3.2 Sparse Vectors 

Let T = Sk, the set of A;-sparse vectors in W 1 , put Uk = {x G S" 1-1 : ||x||o < 
k} and observe that T + ,T1 C L^fe- Therefore, 

n / 2k \ X / 2 

£ = E sup £ 5i Xi = E ^(ff*) 2 , 

i=l \i=l / 

where (v*)2 =1 is a monotone rearrangement of (|f;j|)f =1 . It is standard to 
check (see, e.g., [20]) that there is an absolute constant c such that for every 
1 < k < n/4, 

/2k \ X / 2 

E ( E^) 2 J ^ <VHog(en/A:). 
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Therefore, 



klog(enfk) klog(en/k) 
PSk,N < I V ^ + 



Corollary 3.9 For every L > 1 there are constants c\, c 2 and C3 that de- 
pend only on L and for which the following holds. If inf 1)jM , g j/ fc k(v, w) > 
k, u > c\ and N > c 2 u 3 Hog(en/&:) / ' k 2 , then with probability at least 
1 — 2 exp(— c^uk log(en/fe)), for every s,t £ Sk, 

\\(/){As) - > |||s-t|| 2 ||s + t|| 2 . 

When k is an absolute constant, Corollary 13.81 implies that it is sufficient to 
choose N ~ k log (en/A;) to ensure stable recovery with high probability. 

3.3.3 Finite Set 

Assume that T is a finite set. Then T+,T_ C S* n_1 are of cardinality at 
most \T\ 2 . A straightforward application of the union bound to each random 
variable ^ILi v i9i shows that if V C ffi n is a finite set, then 



EsupVVui < ^log|VKl/). 
Therefore, £ < yiog|Tp ~ ^/log|T|, implying that 



/log |r| log |T| 



AT iV 



Corollary 3.10 For every L > 1 i/iere are constants c\, c 2 and C3 i/iat de- 
pend onZy on L and for which the following holds. If in£v,weT + K (v, w) > 
k, u > c\ and N > c 2 n 3 log \T\/k 2 , then with probability at least 1 — 
2exp(— C3nlog \T\), for every s,t G T, 

\\c/>(As) > ^\\s-t\\ 2 \\s + t\\ 2 . 

In this case, with constant k, N ~ log ]T"| measurements ensure stable re- 
covery with high probability. 
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3.3.4 Block Sparse Vectors 

We next treat the case in which T = consists of block sparse vectors of 
size d [HI [16]. Let {If), t = l,...,n/d be a decomposition of {l,...,n} to 
disjoint blocks of cardinality d. Set W k to be the set of vectors in the unit 
sphere, supported on at most k blocks. Then T + ,T_ C W 2 k, and it remains 
to estimate 

n 

E = E sup y~]giVj. 

v£W 2 k i= i 

Lemma 3.11 There exist absolute constants c\ and 02 for which the follow- 
ing holds. For every < e < 1/2, 

log N(W k ,e) < ci (fclog(era/(dfc)) + dfclog(5/e)) . 

Therefore, 



n 




Proof. Let Ij = {i E Ij, j E J} and observe that 

W k = |J 5 J ', 

{JC{l,...,n}:|J|=fc} 

where for every / C {1, n}, S is the Euclidean sphere on the coordinates 
/. Clearly, there are at most ("j^) such subsets J. Using a standard volu- 
metric estimate (see, e.g., J38J [9] ) , for every fixed set J and every e < 1/2, 
one needs at most {h/e) d \ J]{ = {h/e) dk Euclidean balls of radius e to cover 
S Ij . Therefore, for every < e < 1/2, 

logJV(W fc ,eB?) < klog(en/(dk)) + dk\og(5/e), 

as claimed. 

The second part of the claim is an immediate consequence of Proposi- 
tion [3T3] and the fact that N(T,e) is a decreasing function of e. ■ 

Corollary 3.12 For every L > 1 there are constants c\, c 2 and C3 that 
depend only on L and for which the following holds. Ifmi VtW& w k k(v, w) > K, 
u > c\ and N > c 2 u 3 (k log(en / (dk)) +dk)/K 2 , then with probability at least 
1 — 2exp(— c^u{k\og{en/{dk)) + dk)), for every s,t E S d , 

U{As) - 0(^)||i > |||a - t\\ 2 \\s + t\\ 2 
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When k is constant we conclude that N ~ k(log(en/(kd))+d) measurements 
are needed for stability. This result is consistent with that of [16] which 
shows that the same value N ensures that a random Gaussian matrix satisfies 
the block restricted isometry constant. 

4 Noisy Measurements 

Next, consider the phase retrieval problem in the presence of noise. The 
goal is to find an estimate x of the true signal xo that is close to xq (or — xq) 
in a squared error sense. 
Suppose that 

yi = \{ai,x )\ 2 + Wi , i = l,...,N (4.1) 

for some xq € T. Let a be an isotropic, L-subgaussian random vector and 
assume that the noise w is independent of a, symmetric, and of reasonable 
decay properties, which will be specified in Assumption 14.21 below. 

Question 4.1 Given (ai,yi)^L 1 , combined with the information that the 
noisy data yi is generated by a point xq G T via fij. is it possible to 
produce an estimate x E T for which \\x — xolhll^ + ^olb is small? 

Note that the error is measured by the product \\x — xq\\2\\x + xolhj since it 
is impossible to distinguish between xq and — xo- 

The answer to this question is affirmative, as shown in Theorem 14.81 

4.1 Preliminaries: ip a Random Variables 

Throughout our analysis we assume that the noise w decays properly. In 
order to quantify this decay we rely on the notion of ifj a random variables, 
which are defined below (see [281 148| [9] as general references for properties 
of ip a random variables). 

Definition 4.2 Let X be a random variable. For 1 < a < 2 let 

= inf {C> : Eexp(|X/C| Q ) < 2} , 
and denote by L^ a the set of random variables for which ||X|L < oo. 

The i> a norm can be characterized using information on the tail of X . In- 
deed, there exists an absolute constant c, for which, if t > 1, then Pr(|X| > 
t) < 2exp(—ct a /\\X\\'^ a ). The reverse direction is also true, that is, if 
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Pr(|X| > t) < 2 exp(— t a /A a ), then ||-X"||^, a < c\A for an absolute constant 
It is well known that || |L a is a norm on L^ a , and that 

n vii \\ x h P 

In other words, 

\\X\\l p < \\XU a P 1/a , Vp>l. (4.2) 

In the language of the previous section, X is L-subgaussian if and only if 
— 1 1 -^11^2- Since the ip a norms have a natural hierarchy, it follows 
that if X is L-subgaussian then 

II^IU2 ~ II^IIVl - \\ X \\i>2 ^ C -^II^IU2- 

Therefore, if X is L-subgaussian and mean-zero then ||X|L 2 ~£ ax, where 
ax is the standard deviation of X. 

A straightforward application of the tail behavior of a ip a random vari- 
able implies that if X%, ...,Xn are independent copies of X and t > 1, then 

Pr \muc\Xi\ > *log 1/a iV||^||^) < 2exp(-c 2 t 1 / Q ); 

hence, 

|| max *lk < CslWk log 1/Q N. (4.3) 



From the definition of the ip a norm it is evident that if a = (3/q then 

1 1 



X\ q U a = \\X\\L, (4-4) 



and in particular, X G L^ for (5 > 1 if and only if \X\° G L^ x . 

Although there are versions of the following theorem (and of Definition 
I4.2D for any < a, for the sake of simplicity, we shall restrict ourselves to 
the case a = 1, which is the setting needed in the proofs below. 

Theorem 4.3 There exists an absolute constant c\ for which the following 
holds. If X G Lw,j and X\,...,Xn are independent copies of X, then for 
every t > 0, 



Pr 



( 1 N 

\ i=i 



> t\\X\\fr < 2exp(-c7Vmin{£ 2 ,t}). 
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Combining Theorem 14.31 and (|4.4p leads to the following corollary: 

Corollary 4.4 Let p > 1 and assume that w is a random variable for which 
\w\ p E L^j ('or w G J. Then, with probability at least 1 — 2exp(— ct), 



1 - 

t=l 



The corollary follows immediately from Theorem 14.31 by taking t! = y/t/N 
for < t < N, and since || = \\w\\^ . 

We will also be interested in decay properties of the random variable 
sup tgT \ {X, i)| for a set T C W 1 . If /i is an isotropic, L-subgaussian measure 
on M n , one has the following (see, e.g. 



Theorem 4.5 For every L > 1 there exist constants c\, 02,03 and c\ that 
depend only on L and for which the following holds. If u > c\, then with 
probability at least 1 — 2exp(— C2u\ogN), 

max sup|<ai,t}| 2 < c 3 u (t 2 (T) + d 2 {T) log N) , 

l<i<N igT 

where i(T) and d{T) are defined by (|2.9p and (|2.10p . in particular, 
|| max supKoi,*}! 2 !!^ < C4 (£ 2 (T) + d 2 (T) log iV") . 

4.2 The Recovery Algorithm 

The assumptions we make throughout this section are as follows: 

Assumption 4.1 Assume that a is isotropic and subgaussian, and that the 
noise w in j^. 1\ ) is a symmetric, tp2 random variable that is independent of 
a. 

Recall that the goal is to find an estimate x of xq that is close to xq or 
to — xq. Given the measurements (yi)f = i, a reasonable approach is to seek 
a value of x that minimizes the empirical risk function: 

1 - 

i=l 

for some p. Here we will consider values of p in the regime 1 < p < 2; the 
exact choice of p will become clear later on. Note that for every x £ T, 

4c = I x)\ 2 — y\ P = |(a, 2; — xo)(a, x + xq) — w\ p . (4-6) 
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Since the empirical average P^l x is not a convex function in x, it is im- 
possible in general to find a value of x that minimizes it. Luckily, for our 
purposes, one does not need an exact minimizer. Instead, in order to bound 
the estimation error, it is sufficient to find a value of x for which the em- 
pirical risk is bounded above, as incorporated in the Definition 14.61 below. 
To this end, one may use any algorithm for phase minimization and check 
whether the resulting solution satisfies the bound. Particularly useful in 
this context are techniques that depend on the initial starting point; such 
methods can be started from several different points, and in that way, if a 
particular solution does not satisfy the bound then the algorithm may be 
used again, but from a different starting point. Eventually, with high prob- 
ability, a point satisfying the bound will be obtained. One algorithm of this 
form is the GESPAR method developed in [42] , 



Definition 4.6 Let 1 < p < 2 be given, and choose a value of u> 1. Given 
the data {a,i,yi)f =l , x G T is called a good estimate if it satisfies that 



N 

where 



1 N /III If 1 1 \ 

l - J2 \\(^x)\ 2 -y t \ P < E\w\p + u [Q t ,n,w ~ ^=^J . ( 4 -7) 



Qt,n - a{l )-j= -\ — , Qt,N,W = Qt,N H -r= , (4.8 j 



and£(T),d(T) are defined by fl2Ul) . fl2~T0l) . 

To motivate the choice of x in Definition I4.61 observe that Qt,nw 
captures the "statistical complexity" of the problem - namely, the sum 
of the "gaussian complexity" of T, Qt,n, an d the influence of the noise, 
III^HIi/ii/V^V- The parameter u tunes the probability estimate, for the 
moment is of secondary importance. The exact choice of p and u will be 
specified in Theorem 14.81 

This approach is based on a modified empirical risk minimization - mod- 
ified in two ways. First, instead of minimizing the loss functional Pn^xi the 
search is for an empirical feasible point; some x S T for which 

N N 



— 1 1 

_/V" 1 N 

i=l 



)| 2 - - 4 E H P ^ u Qt,N,W- (4.9) 



Observe that the value on the left hand side of (|4,9p is the empirical excess 
risk Pn£-x where 

£x = 4 - 4 = \(a,x- x Q )(a,x + x ) - w\ p - \w\ p , (4.10) 
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is the excess loss functional. The definition implies that the empirical excess 
risk at x is of the same order of magnitude as the "statistical error" and thus 

Pn£>x < uQt,n,w- 

One of the key components of the proof is to show that if Pn^-x is small, 
then so is the conditional expectation, MC X . The second key component is 
that if is small, then so is \\x — xolbll^ + x o\\2- 

Unfortunately, it is impossible to estimate the empirical excess risk since 
one does not have access to the sampled noise Wi,...,Wn, and therefore, 
nor to jj ^2iL\ \ wi\ p _ which is the reason for the second modification. By 
Assumption I4.2| w E L^ 2 and consequently \w\ p S L^ x . From Corollary! 
if u < N, then with probability at least 1 — 2exp(— c\u 2 ), 



N in ipi, 

u- 



1 N 

-J2\ w i\ P > E \ w 



Therefore, if x satisfies (|4.7p . then it also satisfies (|4.9p . meaning that its 
empirical excess risk is bounded above by the desired quantity. This leads 
to the following proposition. 

Proposition 4.7 There exists an absolute constant c\ for which the follow- 
ing holds. Let x be a point that satisfies (|4.7p and let w S L^ 2 . IfO < u < N, 



then with probability at least 1 — 2exp(— c\u ), Pn£-x < u Qt 



N,W- 



To see that there is always a point x that satisfies (|4.7p . observe that 
for xo and < u < N, with probability at least 1 — 2exp(— cu 2 ) (see Corol- 
lary [44]), 



N N 



— Y,\\( ai ,x )\ 2 - Vl \ p =— J2 h p < nw\ p + u- 
i=i t=i vjv 

<M\w\ p + uQ t ,n,w- 



\ w \ p \\ipi 



Moreover, unless T is very small and W is very large, Qt,n is the dominant 
term in Qt,n,w- For example, consider the case in which w is a centered 
Gaussian with variance a and T is the set of /c-sparse vectors on the unit 
sphere. Then, HM^Ui/V^ ~ a p /VN, while Q t ,n ~ v / fclog(en/fc)/v / iV 
which clearly is larger than |||tu| p ||^ 1 /\/^V, as long as k is large relative to a. 

We are now ready to state our main result. To this end recall the defi- 
nition of K,(s,t) given by (I2.6|) . and let kt = inf Sj t e r«(s, t). 
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Theorem 4.8 For every k > and every L > 1 i/iere exists constants 
c l, c 2, c 3 and C4 t/iat depend only on L and k, for which the following holds. 
Let a be distributed according to an isotropic, L-subgaussian measure, and 
assume that kt > k. Assume further that \\w\\^ 2 < oo. For every integer N 
set 

P N = max{ Cl {{\\w\\^ + d 2 {T))\ogN + £ 2 {T)) ,e} 

and 

p = l + l/]og/3 N . 

Let x be chosen to satisfy (|4.7p . Then, for u > Oi, with probability at least 
1 - 2exp(-c 3 n 1 / 3 ), 

\\xo ~ ^Iblko + x\h < c 4 {uQ t ,n, w) 1/p V 1 °SI3n, 
where Qt,n,w is defined by (f-^.ffp . 

Note that ||w||,/, 2 < oo implies that < oo for any p < 2. 

Since Qt,n,w decays as y/N while (3n grows as log N, it is always possible 
to choose N large enough so that the error given in the theorem is made 
sufficiently small. As an example, consider the case of £;-sparse vectors on 
the sphere. Hence, d(T) = 1, and recall from Section 13.21 that £{Uk) ~ 
(/clog(en/fc)) 1//2 . If w is L-subgaussian then ll^ll^ <l f where a is the 
noise standard deviation, and 

/3tv~l (cr + l)logiV + Hog(en//c). 

If klog(en/k) > (a + l)logiV (which is the reasonable range, as one ex- 
pects N ~ k up to logarithmic factors), then (3m ~ klog(en/k), and by 
Theorem I4.8j, 

\\x - x \\ 2 \\x + x \\ 2 < L , K ( fcl ° g ^ n/fc) ) 1/2 " C/1 ° S/3JV , (4.11) 

where c is a constant. 

To proceed, and as will be noted in Section in the case of linear 
measurements, with high probability, 

2 { klog(en/k) \ 1/2 
ll*-*o|| a <i^ ) ■ 

To compare the "quadratic" estimate with the linear one, note that if N < 
(klog(en/k))' y for 7 > c\ and some constant c\ > 1, then recalling that for 
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every x, x 1 / logx < e, it is evident that 

= (Hog(en/A;)) c(7 ~ 1)/log/3jv ~ ^"^^^ „ 
where C is an absolute constant. Therefore, with this choice of N, 

\\x - x \\ 2 \\x + x || 2 <L, K , 7 y/log(k\og(en/k)) 

and up to logarithmic factors scales as the estimate in the linear case. 

Clearly, it suffices to take N >l,-j,e k log (en/ A;) log k to ensure that \\x — 
3^0 II 2 1|^ + a^o II 2 < £ ; which is off only by a log/c factor from the optimal 
estimate in the linear case. 

Corollary 4.9 For every L > 1 and k > there exist constants c\, 02,03 
that depend only on L and k and for which the following holds. Let T be the 
set of k-sparse vectors on the sphere, set a to be distributed according to an 
isotropic, L-subgaussian measure and assume that kt > k>- If the noise w 
is L-subgaussian, N < (k log(en//c)) 7 for 7 > c\ > 1 and u > ci, then with 
probability at least 1 — 2exp(— C3U 1 / 3 ), 

\\x - x \\ 2 \\x + x \\ 2 <L,K,y,u \J\og{k\og{en/k)) 

Ln particular, if N >l,7,£,<5 k log(en/fe) log k then \\x — xq\\2\\x + X0II2 < e 
with probability at least 1 — 5. 



k log(en//c) 



N 



klog(en/k)\ ' 



N 



klog(en/k)\ 1 



N 



4.3 Proof of Theorem l4~8l 



The proof of the theorem requires several preliminary facts about empirical 
and Bernoulli processes. We refer the reader to |28l |2"T] for more details on 
these processes. 

Throughout this section, (Q, fj,) is a probability space and (Xi)f =l are 
iid, distributed according to /1. Let ei,...,ejv be independent, symmetric, 
{ — 1, l}-valued random variables, that are independent of X\, ...,X^. 

The first result we require is the contraction inequality for Bernoulli 
processes. 

Theorem 4.10 1281/ Let F : 1R + — > 1R + be convex and increasing and let 
4>i : R — > R satisfy that (f)(0) = and maxi<j<Ar ||^j||iip < A, where ||0i||ii p 
is the Lipschitz constant of fa. Then, for any bounded T C 



EF [ — - sup 



N 



y~]ei<l>i(ti 



< EF 



sup 



N 



2(3 



The following symmetrization argument allows one to bound an empirical 
process using the Bernoulli process indexed by the random set {(h(Xi))^L 1 : 
h G H}. 



Theorem 4.11 Jftfl Let F : I 

be a class of functions. Then 



M + be convex and increasing and let % 



EF sup 

\h£H 



N 



1=1 



Eh 



< EF 2 sup 

\ h£H 



< EF 4 sup 

\ h€H 



N 



i=l 



1 N 
- V 



AT ^ 



Eh 



We will use Theorem I4. 10! and Theorem 14. 1 II with F(x) = \x\ q for q>2. 

The final result we require is the Kahane-Khintchine inequality [28], on 
the moments of Bernoulli processes. 

Theorem 4.12 There exists an absolute constant c for which the following 
holds. If T C R N andq>2 then 



sup 



N 

£ 

i=l 



< Cy^gEsup 



N 



i=l 



sup 



i=l 



The first step in the proof of Theorem H31 is to obtain an oracle inequality 
for EC X that holds for any x G T (see Lemma 14.131 below). The oracle 
inequality is used for x = x, and noting that for a good x, Pn£& is bounded 
above by uQt,n,w leads to an upper bound of the form EC X < uQt,n,w- 
The second part of the proof consists of establishing a lower bound on E£ x 
which is a function of ||seo — ccH^H^o + ^Ill- 
Lemma 4.13 For every L > 1 there exist constants c\,C2 and C3 that de- 
pend only on L for which the following holds. If ' p is chosen as in Theorem 
\4-S\ then for u > c\, with probability at least 1 — 2 exp(— C2U 1 ^ 3 ), for every 
x £T, 

l(T) , f{T)' 



EC X < P N C X + c 3 u d(T)-^ + 



N 

The choice made above, of p = 1 + l/log/3/v is the key point in the proof, 
and it is there to balance two issues. On one hand, if p is larger than 1, then 
the empirical process x — > Pn^-x — EC X becomes much harder to control. On 
the other, if p = 1, then the loss is not strictly convex, and it is impossible 
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to lower bound KC X using a function of \\x — xq\\2\\x + xolh- This choice of p 
is sufficiently close to 1 to enable control of the empirical process (the main 
point is (|4.12p ). while it is far enough from 1 to give enough convexity to 
enable the lower bound. 

Proof. Fix q > 2. By the symmetrization theorem (Theorem 14. and the 
independence of a and W, 



EsupIPtv^-E^I 9 < EE £ sup 

x£T x&T 



N 



N 

£< 

i=l 



.ai,x - xo)(ai,x + x Q/ 



Wi\ P -\Wi\ P ) 



Let 



Doo^n = 2 max ( \wi\ + sup \(ai, x - x )(ai, x + x )\ ) 

l<i<N V x£T /X 'J 



\y 



and observe that for every realization of (wi)^L 1 , the functions y 
Wi\ p — \uii\ p vanish at and are Lipschitz on [—b,b] with a constant p(b + 
|u;i|) p_1 . For b < maxi<j<7v sup xeT | (oj, x — xo)(aj,x + xo)| this constant 
is proportional to ]y, since p < 2. Applying the contraction inequality 
(Theorem 14. 10[) . conditioned on wi,...,wn and ai,...,ajv 5 



E e sup 



N 



<(c£>^)*E £S up 

x&T 



yZ £i (K a i' x _ x ){ai,x + x ) - - 

AT 

XI e *( a »> x ~ x ){ai,x + x ) 



i=l 



By the Kahane-Khintchine inequality, the Cauchy-Schwarz inequality, 
and Jensen's inequality combined with reverse symmetrization (the other 
direction of Theorem 14. lip , 



E 



axW 



x<=T 



N 



y^gi(ai,x - xo){a h x + x ) 



i=i 



^(ciy 7 ?)^ 



(D p -N) q E e sup 



IV 



y^£i(ai,a; - x )(aj,x + x ) 
i=i 



<(ciV9) 9 ll^lll 29 (E axW (E £ sup 



xeT 



y^£i(ai,x - x )(a;,x + x ) 



i=l 



2«\ V2 



i2o 



sup 



1 * 

— y"](oj, x - x ){ai, x + x ) - E{a,i,x - x )(a;, x + x ) 



i=l 
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where || \\l 2 is taken with respect to the iV-product measure (a <S> w) . 
Setting 



sup 



1 N 

- ^2(ai,x - x )(ai, x + x ) - E(a,i,x - x )(ai,x + x ) 



N . 
i=i 



it follows that 

\\sup\P N £ x -E£ x \\\ Lq < c 2 y/q\\D p ~ l N \\ L2q B T , N)q , 

and it remains to bound ^ || z, 2g and Bx,N,q- 

To estimate -Bt,tv,<j, set T = {x — s : s, t € T U (— T)}. Note that 
^(T) < 2€(T) and that d(T) < 2d(T). Applying Theorem ESI 

B T ,N, q < cq 3 / 2 (d{T)t{T)/VN + l 2 {T)/N) . 

Turning to H-D^ jyllz^, observe that pointwise 

max sup I (di, x — xn)(ai, x + xn) I < max sup \(as, x)\ 2 . 
l<i<N xeT ' x /x n i<i<N x&T lx n 



Therefore, by Theorem 14.51 

|| max sup \(a,x - xo)(a, x + xo)|||»/>i ^ 2 (T) + d 2 (T) log N. 



Hence, by (|Ol 

||-Doo,iv||^i <2 ( || max Wi\\^ + || max sup \(a, x - x )(a, x + x )\ \\^ J 

<c L ((|Mk + d 2 (T)) log N + e 2 (T)) =f3 N . 

Set p = 1 + l/log/3jv- With this choice, combined with the moment charac- 
terization of the tpi norm (|4.2p . it is evident that 

||^lk 29 < cm v - x (4.12) 

for a suitable absolute constant C5. Indeed, 

tu n (p-i)2gV /(p_1)29 <r c nun 11 
^ooV ~ (P- I)?ll-Doo,jv|lvfii 
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and thus, 

\\D p ^ N \\ L2q < ((p-iW-'WD^V Z {(p-iW-'pT^ < <%<f- x - 

With these two estimates, it is evident that there exists a constant cq 
that depends only on L for which, for every q > 2, 

IIhupI^-ELCIIIz, W/W+b-i) (d(T)& + ^-) 

x£T V iV / 

With this L q estimate at hand, it is standard to show (see, e.g., [9] for a sim- 
ilar argument), that for u > 1, with probability at least 1 — 2 exp(— C7U 1 / 3 ), 

/ f(T) £ 2 (T) 
sup \P N C X - EC X \ < c 8 u d(T)-7= + ^T 2 
xeT V V-/V 

where C7 and cs depend only on L. 

Finally, in this case, for every x € T 

EC X < P N C X + c 8 u (V)^= + ^ 



as claimed. ■ 

With Lemma 14.131 in mind, the choice of x becomes clearer. One would 
like to find any point in T for which Pn£-x is, at most, of the same order of 
magnitude as the combined complexity term of the set T and the noise 

l(T) f 2 (TA ||M1k 



In this case, KC X can be bounded above by Qt,n,w- It follows from Propo- 
sition H2] and Lemma [4. 131 that with probability at least 1 — 2 exp(— cu 1 / 3 ) — 
2exp(— cu 2 ), both 

Pn£x < uQt,n,w 

and 

EC X < Pn£-x + ciuQ t ,n,w- 

Therefore, 

E£ £ < c 2 uQ t ,n- (4.13) 



30 



To complete the proof of Theorem 14. 8\ one has to bound EC X from below. 
The fact that p > 1 gives "enough convexity" to establish the desired lower 
bound. 

Given x S T set h x = (a,x — xo){a,x + xo) and recall that C x = 
\h x (a) — w\ p — \w\ p . Since w is a symmetric random variable, it is distributed 
as s\w\, where e is a symmetric { — 1, l}-valued random variable, independent 
of w and of a. Therefore, 

EC X =E axW E e (\h x (a) - e\w\\ p - \w\ p ) 

=E axW Q |H - h x (a)\ p + i |H + ^(a)| p - • (4-14) 

It is well known (see, e.g., \30\ I24j ) that if 1 < p < 2, then for every c, d G K, 

^ (|c + + |c - d|P) > (c 2 + (p - l)d 2 ) p/2 . (4.15) 

Observe that the function /(f) = (t 2 + (p - l)d 2 ) p / 2 - ^ is increasing for 
t > and that /(0) = (p - l) p/2 <P. Hence, for every c, d, 

(c 2 + (p - l)d 2 ) p/2 - & > (p - l) p / 2 d p . 

Taking c = |to| and d = \h x (a)\, 

EC X > (p - l)P/ 2 E|^(a)|P = (p - l) p/2 E|(a,x - x >(a,x + x >| p . 

By the definition of n(s, t), for every s, t and p > 1, 

(E|(a,f)(a,,s}| p ) 1/p > E\(a,t)(a,s)\ > K T p|| 2 ||s|| 2 > «||f || 2 ||s||2- 
Therefore, 

E£ x > K p (p - l) p / 2 \\x - x \\ p \\x + x \\ p . 

Combining this lower bound with ()4.13j) . and recalling that p = 1 + 1/ log (3n 
completes the proof of the theorem. 

4.4 Examples 

Let us present some of the examples seen in Section 13.21 in the noisy setting. 
Other examples may be obtained with similar ease. 

In order to apply the results of Theorem 14.81 one has to determine 
d(T), £(T),E\w\ p , II^H^, \\w\\^ 2 and || (ui^H^ . 

In all the examples below we will assume that T C S™ -1 and so d(T) = 1. 
Since w is symmetric and L-subgaussian, then < ||u>||^, 2 < La, where 

a is the noise variance. Also, since 1 < p < 2, |||w| p ||^ 1 = ||w||^ < (La) p . 
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4.4.1 Entire Space T = W 1 

If T = S"™" 1 then £(T) ~ y/n, implying that 



Qt,n < ( a/^ + ^J 



for the regime of N we are interested in, and 



n o v 



QT,N,W < A / + rjj- 



In addition, 

(3 N ~ (a + 1) log N + n. 
Suppose that n > (a + 1) log N. Then, by Theorem 14.81 

/ fl \ 1/2— c/ log/3jv 

llx-XolbP + ^olb vlog/3jv^J , (4.16) 

where c is an absolute constant. If N < ?t, 7 for 7 > ci > 1, then 

^-c/logfe = nC(7 _i) /log/3jv _ ^cfr-lVlogAv „ C 7-l ; 

for a suitable absolute constant C. Therefore, with this choice of N, 



IX — Xq 



and it suffices to take N >z,, 7 ,K, e ,<5 re log n to ensure that — £o||2||£+£o||2 < 
e with probability at least 1 — 5. 

Corollary 4.14 For every L > 1 and n > i/iere exist constants c\, 02,03 
that depend only on L and k and for which the following holds. If /i, a and 
w are as above, T = S"™" 1 and N < n 1 for 7 > c\ > 1, then for u > 02 with 
probability at least 1 — 2exp(— C3U 1 / 3 ), 



\x - X0II2P + S0II2 <L, K , 7 ,« Vlogn v 



1/2 



4.4.2 Sparse Vectors 

We already treated the case of sparse vectors in Corollary 14.91 The block- 
sparse setting can be treated in a similar manner, leading to the following 
corollary. 
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Corollary 4.15 For every L > 1 and k > there exist constants c\, 02,03 
that depend only on L and k and for which the following holds. If a and 
w are as above, T is the set of k-block sparse vectors of length d on the 
sphere and N < (klog(en/dk) + dk)" ( for 7 > c\ > 1, then for u > C2 with 
probability at least 1 — 2exp(— C311 1 / 3 ), 



\\x-xq\\ 2 \\x+xo\\2 <l,k,j,u y / log(klog(en/dk) + dk) 

In particular, if N >L,~t,e,5 (klog(en/kd) + dk) (log k + logd) then \\x — 
^olbll^ + X0W2 < £ with probability at least 1 — 5. 



k log(en/dk) + dk 
N 



5 Connection with Results on Linear Estimation 

It should come as no surprise that the methods used here are very similar in 
nature to the analogous "linear questions" . Both stability and noisy recovery 
are well understood in the linear case, and in a sharp way, as we will explain 
below. 

First, consider the question of stability. Suppose that the measurements 
are given by y = Ax for some N x n matrix A. In the linear setting, a 
natural notion of stability in a set T C W 1 is that for all s, t G T, 

\\At - As\\ 2 > C\\t - s|| 2 , (5.1) 

where C is a constant. 

Note that here the £2 norm is used in the left hand side, rather than the 
i\ norm. An £2 stability result is superior to an £\ estimate, simply because 
the £2 norm is smaller. And, It is natural to compare an £2 stability result 
in the linear case to the £\ stability result for quadratic measurements we 
established. 

Stability in a set T for a random ensemble depends on the way in which 
a typical operator acts on the set 

T_ = { / - S „ : t ^ s, t£T}c S"- 1 . 

\\t - S\\2 

Indeed, because a is distributed according to an isotropic measure, for every 
z e S 11 ' 1 , E|(a,z)| 2 = 1. Thus, stability on T is equivalent to an estimate 
on 



N 



±^2\( ai ,z)\ 2 -l , (5.2) 
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which is strictly smaller than 1. 

With this in mind, the stability constant in R n is a lower bound on the 
smallest singular value of a typical operator from the given random ensemble. 

The study of the process (|5.2p . both for T_ = S"™" 1 and for an arbitrary 
subset of the sphere has been extensive in recent years. A good starting 
point for the interested reader would be |25t HTj for subgaussian ensembles, 
\4:0\ [32] for log-concave ensembles, and [44 \ [33| IM] for ensembles with heavy 
tails (though this does not begin to cover the extensive literature on the 
topic). 

In the context of this paper, subgaussian ensembles, the best estimate 
on (|5.2p follows from Theorem 12 .8\ applied to the class F = H = {{v,\ t)£ 
T!_}. Moreover, in [31] it was shown that under very mild assumptions on 
the set T_, the estimate is sharp. 

Theorem 5.1 For every L > 1 there exist constants ci,C2 and C3 that de- 
pend only on L for which the following holds. IfTc W 1 and a is distributed 
according to an isotropic, L-subgaussian measure, then for u > C\, with 
probability at least 1 — 2exp(— C2U^(T_)), for every s,t £ T, 

\\As - At\\ 2 > \\s - t\\ 2 /y/2, 
provided that N > c 3 u 3 l 2 (T_). 
Proof. Since 



for z = (t — s)/\\t — s 1 1 2 G T-, then setting z s j = N~ x Yli=i \{ a i-, z )\ 2 -, h 
suffices to bound m.i S) t^T, s^t z s,t from below. Since a is isotropic, Ez s t = 
E|(o, z)\ 2 = 1. Applying Theorem [2TH1 for N >l u 6 £ 2 (T^) and recalling that 
7L C S"™ -1 , it follows that with probability at least 1 — 2 exp(— cu 2 £ (T_)), 





On that event, for every s ^ t 



As - At 



\\A(s-t)\\ 2 > ||*-t||I/2, 



as claimed. 
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Observe that the same complexity parameter appears in the linear case as 
in the "quadratic" stability result - the gaussian complexity of a "projection" 
of T — T onto the sphere (and, of course, the T + T component does not 
appear). In all the examples we presented in this note, T + T and T — T have 
essentially (or exactly) the same complexity, and thus the stability estimates 
in the linear case coincide with quadratic bounds, as will be the case for any 
T C W 1 with a similar property. Thus, in these cases, there is no harm in 
requiring stability over quadratic measurements rather than with respect to 
linear ones. 

The noisy recovery problem in the linear case is much simpler, since the 
resulting empirical process is well behaved even if one uses the squared loss 
functional. The advantage in considering the squared loss functional is that 
one has the benefit of the required convexity "for free". With this objective, 
noisy recovery becomes a linear regression problem in W 1 , indexed by T. 
This is a well studied topic in statistics. We refer the reader to [26] for 
relatively recent results related to this question. 

The best results to-date on linear regression that take into account the 
complexity of the indexing set T can be found in [31] . One may show 
that these estimates are sharp under very mild assumptions on T, and it 
turns out that these assumptions are satisfied in the examples that were 
presented here. Since our bounds in the "quadratic" case are of the same 
order of magnitude as in the easier, linear case, and since these bounds are 
optimal in the linear case, it is reasonable to expect that they are optimal 
in the quadratic scenario as well. Unfortunately, the methods required to 
prove this optimality are rather involved, and we will not explore this issue 
here. Rather we refer the reader to [31] . in which the linear case is explored. 
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